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Abstract — In this paper we present a novel strategy to 
discover the community structure of (possibly, large) networks. 
This approach is based on the well-know concept of network 
modularity optimization. To do so, our algorithm exploits 
a novel measure of edge centrality, based on the K-paths. 
This technique allows to efficiently compute a edge ranking 
in large networks in near linear time. Once the centrality 
ranking is calculated, the algorithm computes the pairwise 
proximity between nodes of the network. Finally, it discovers 
the community structure adopting a strategy inspired by the 
well-known state-of-the-art Louvain method (henceforth, LM), 
efficiently maximizing the network modularity. The experi- 
ments we carried out show that our algorithm outperforms 
other techniques and slightly improves results of the original 
LM, providing reliable results. Another advantage is that its 
adoption is naturally extended even to unweighted networks, 
differently with respect to the LM. 
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I. Introduction 

The investigation of the community structure inside net- 
works has acquired a great relevance during the last years, 
in particular in the context of Social Network Analysis 
(SNA). This, also because of the unpredicted success of 
Online Social Networks (OSNs). In fact, social phenomena 
such as Facebook and Twitter amongst others, glue together 
millions of users under a unique network whose features 
are a goldmine for Social Scientists. Several works are 
focused on the Social Network analysis of these OSNs; 
others describe the strategies of analysis themselves. 

In this paper we focus on the possible strategies of com- 
munity detection. As to date, two paradigms exist to discover 
the community structure of a network. The former is based 
on the analysis of the global features of the network, for 
example its topology. These approaches are characterized by 
high computational complexity and high quality results. The 
latter paradigm relies on exploiting local information, for 
example those acquirable by nodes and their neighborhoods. 
The computational cost of these techniques is lower than 
those exploiting global features, but the reliability decreases. 

In this work, we propose a novel strategy to discover 
the inner community structure of a network. The main 
characteristics of our approach are the followings: i) it 
exploits global information of the network, establishing 
which are the edges of the network that contribute to the 
creation of the community structure; ii) to do so, it adopts 



a novel measure of edge centrality, in order to rank all 
the edges of the network with respect to their proclivity 
to propagate information through the network itself; iii) 
its computational cost is low, making it feasible even for 
large network analysis; iv) it is able to produce reliable 
results, even if compared with those calculated by using 
more complex techniques, when this is possible; in fact, 
because of the computational constraints, the adoption of 
some existing techniques is not viable when considering 
large networks, and their application is only limited to small 
case-studies. 

This paper is organized as follows: in the next Section we 
provide some background information about the community 



detection problem. Section III introduces the main objectives 
of this work and describes an intuitive sketch about the novel 
strategy of community detection we propose. In Section IV 
the key concept of K-path edge centrality is recalled, being it 
a novel and efficient strategy of ranking edges with respect 
to their centrality in the network. All the pieces are glued 
together in Section [V] We describe our strategy to detect 
the community structure, inspired by the well-known state- 
of-the-art LM [1], which is computationally suitable even 
when large networks are analyzed. Experiments that have 
been carried out are discussed in Section fvT| Finally, Section 



VII concludes, depicting some future directions of research. 



II. Background 

Several techniques to investigate the community structure 
of networks have been proposed in literature during last 
years. There exist numerous comprehensive surveys to this 
problem, such as Q, 0. 

In its general formulation, the problem of finding commu- 
nities in a network is intended as a data clustering problem. 
In fact, it could be solved assigning each node of the network 
to a cluster, in a meaningful way. Two approaches have been 
widely investigated, i) spectral clustering based techniques, 
and, ii) network modularity optimization strategies. The 
former relies on the optimization of the process of cutting 
the graph representing the given network. The latter is based 
on the maximization of a benefit function, called network 
modularity. We briefly recall them, separately. 

The problem of minimizing the number of cuts in a 
given graph has been proved to be NP-hard. To do so, 
different approximate techniques have been proposed. An 



example is by using the spectral clustering [4|, exploiting 
the eigenvectors of the Laplacian matrix of the network. 
We recall that the Laplacian matrix L of a given graph has 
components Ly = ki8(i, j) — Aij, where k% is the degree of 
a node i, S(i,j) is the Kronecker delta (that is, S(i,j) = 1 
if and only if i = j) and A t j is the adjacency matrix 
representing the graph connections. Another approach relies 
on the strategy of the ratio cut partitioning Q, J6|. This 
is a function that, if minimized, allows the identification 
of large clusters with a minimum number of outgoing 
interconnections. The principal issue of spectral clustering 
based techniques is that one has to know in advance the 
number and the size of communities comprised in the given 
network. This makes this strategy unfeasible if the purpose is 
to discover the unknown community structure of a network. 

The strategy exploited in this paper adopts the second 
paradigm, the one relying on the concept of network modu- 
larity. It can be explained as follows: let consider a network, 
represented by means of a graph G = (V, E), partitioned 
into m communities; assuming l s the number of edges 
between nodes belonging to the s-th community and d s is 
the sum of the degrees of the nodes in the s-th community, 
the network modularity Q is given by 




Intuitively, high values of Q implies high values of l s for 
each discovered community; thus, detected communities are 
dense within their structure and weakly coupled among each 
other. Equation [Tj reveals a possible maximization strategy: 
in order to increase the value of the first term (namely, the 
coverage), the highest possible number of edges should fall 
in each given community, whereas the minimization of the 
second term is obtained by dividing the network in several 
communities with small total degrees. 

The problem of maximizing the network modularity has 
been proved to be NP complete [7]. To this purpose, several 
heuristic strategies to maximize the network modularity Q 
have been proposed as to date. Probably, the most pop- 
ular one is called Girvan-Newman strategy O, 0. This 
approach works in two steps, i) ranking edges by using the 
betweenness centrality as measure of importance; ii) deleting 
edges in order of importance, evaluating the increase of the 
value of Q. In fact, it is possible to maximize the network 
modularity deleting edges with high value of betweenness 
centrality, based on the intuition that they connect nodes be- 
longing to different communities. The process iterates until 
a significant increase of Q is obtained. At each iteration, 
each connected component of S identifies a community. 
Unfortunately, the computational cost of this strategy is very 
high (i.e., 0(n 3 ), being n the number of nodes in S). This 
makes it unsuitable for the analysis of large networks. The 
largest part of its cost is given by the calculation of the 



betweenness centrality, that is itself very costly (even if the 
most efficient algorithm [ 10] is adopted). 

Several variants of this strategy have been proposed during 
the years, such as the fast clustering algorithm provided by 
Clauset, Newman and Moore [11], that runs in 0(n log n) on 
sparse graphs; the extremal optimization method proposed 
by Duch and Arenas HI 211 . based on a fast agglomerative 
approach, with 0(n 2 logn) time complexity; the Newman- 
Leicht [13 1 mixture model based on statistical inferences; 
other maximization techniques by Newman lfl4l based on 
eigenvectors and matrices. 

The state-of-the-art technique is called Louvain method 
(LM) [1|. This strategy is based on local information and 
is well-suited for analyzing large weighted networks. It is 
based on the two simple steps: i) each node is assigned 
to a community chosen in order to maximize the network 
modularity Q; the gain derived from moving a node i into 
a community C can simply be calculated as fT| 

-[&-(%)"-(£)> 

where J^c * s tne sum °f trie weights of the edges inside 
C, J2c ' s sum °f ^ e weights of the edges incident 
to nodes in C, fc, is the sum of the weights of the edges 
incident to node i, kf is the sum of the weights of the edges 
from i to nodes in C, m is the sum of the weights of all 
the edges in the network; ii) the second step simply makes a 
new network consisting of nodes that are those communities 
previously found. Then the process iterates until a significant 
improvement of the network modularity is obtained. 

In this paper we present an efficient community detection 
algorithm which represents a generalization of the LM. 
In fact, it can be applied even on unweighted networks 
and, most importantly, it exploits both global and local 
information. To make this possible, our strategy computes 
the pairwise distance between nodes of the network. To do 
so, edges are weighted by using a global feature which 
represents their aptitude to propagate information through 
the network. The edge weighting is based on the K-path 
edge centrality, a novel measure whose calculation requires 
a near linear computational cost 1151 . Thus, the partition of 
the network is obtained improving the LM. Details of our 
strategy are explained in the following. 

III. Design Goals 

In this Section we briefly and informally discuss the ideas 
behind our strategy. First of all, we explain the principal 
motivations that make our approach suitable, in particular 
but not only, for the analysis of the community structure of 
Social Networks. To this purpose, we introduce a real-life 
example from which we infer some features of our approach. 

Let consider a social network, in which users are con- 
nected among them by friendship relations. In this context, 



we can assume that one of the principal activities could be 
exchanging information. Thus, let assume that a "message" 
(that, could be, for example, a wall post on Facebook or a 
tweet on Twitter) represents the simplest "piece" of informa- 
tion and that users of this network could exchange messages, 
by means of their connections. This means that a user could 
both directly send and receive information only to/from the 
people in her neighborhood. In fact, this assumption will be 
fundamental (see further), in order to define the concepts of 
community and community structure. Intuitively, say that a 
community is defined as a group of individuals in which the 
interconnections are denser than outside the group (in fact, 
this maximizes the benefit function Q). 

The aim of our community detection algorithm is to 
identify the partitioning of the network in communities, such 
that the network modularity is optimal. To do so, our strategy 
is to rank links of the network on the basis of their aptitude 
of favoring the diffusion of information. In detail, the higher 
the ability of a node to propagate a message, the higher 
its centrality in the network. This is important because, as 
already proved by (U, (9), we could ensure that the higher 
the centrality of a edge, the higher the probability that it 
connects different communities. 

Our algorithm adopts different optimizations in order to 
efficiently compute the link ranking. Once we define an 
optimized strategy for ranking links, we can compute the 
pairwise distances between nodes and finally the partitioning 
of the network, according to the LM. The evaluation of the 
goodness of the partitioning in communities is attained by 
adopting the measure of the network modularity Q. 

In the next sections we shall discuss how our algorithm is 
able to incorporate these requirements. First of all, in Section 
IV we formally provide a definition of centrality of edges 
in social networks based on the propagation of messages by 
using simple random walks of length at most k (called, here- 
after, K-path edge centrality). Then, we provide a description 
of an efficient implementation of this algorithm, running in 
0(k\E\), where \E\ is the number of edges in the network. 
After this, in Section [V] we discuss the technical details of 
our community detection algorithm. 



IV. k-Path Edge Centrality 



The concept of K-path edge centrality has been recently 
defined lfl5ll as follows: 

Definition 1: (K-path edge centrality) For each edge e of 
a graph G — (V,E), the K-path edge centrality L K (e) of 
e is defined as the sum, over all possible source nodes s, 
of the percentage of times that a message originated from 
s traverses e, assuming that the message traversals are only 
along random simple paths of at most k edges. 



The K-path edge centrality is formalized, for an arbitrary 
edge e, as follows 



L K (e) 



sev 



7IT 



(3) 



where s are all the possible source nodes, 7r"(e) is the 
number of K-paths originating from s and traversing the edge 
e and 7r" is the number of K-paths originating from s. 

A. Fast K-path Edge Centrality Algorithm 

In this section we recall the functioning of the strategy 
adopted to efficiently compute the K-path edge centrality. 
The proposed algorithm |15| is called Weighted Edge Ran- 
dom Walk K-Path Centrality (or, shortly, WERW-Kpath). 
It consists of two main steps: i) node and edge weights 
assignment, and ii) simulation of message propagations 
using random simple paths of length at most k. In the 
following, the two steps are discussed separately. 

1) Step 1: Node and edge weights assignment 

In the first stage, the algorithm assigns a weight to both 
nodes and edges of the graph G = (V,E) representing the 
given network. Node weights are exploited to choose the 
source nodes from which the simulation of the message 
propagations starts; edge weights represent initial values of 
centrality and they are updated during the execution of the 
algorithm. At the end of the execution of p simulations, 
where the optimal value p — \E\ — 1 has been proved in 
lfT31 . edge weights are exploited for the edge ranking. 

To compute node weights, we recall the notion of local 
effective density 8(v) of a node v, as follows: 

Definition 2: Local effective density Given a graph G = 
(V,E) and a node v, its local effective density S(v) is 

\i(v)\ + \o(v)\ 



S(v) 



2\E\ 



where I(v) and O(v) represent, 



respectively, the number of ingoing and outgoing edges 
incident on the node v. 

This value intuitively represents how much a node con- 
tributes to the overall connectivity of the graph. The higher 
5(v), the better v is connected in the graph. 

As for edge weights, we recall the following definition: 

Definition 3: Initial edge weight Given a graph G = 
(V,E) and an edge e, its initial edge weight oj(e)° is 

oj(e)° = -. — r where \E\ is the cardinality of E. 

\ E \ 

Intuitively, we initially manage a "budget" consisting of 
\E\ points; these points are equally divided among all the 
nodes; the amount of points received by each edge represents 
its initial rank. 

2) Step 2: Simulation of message propagations 

In the second step we simulate p simple random walks of 
length at most k on the network. In detail, at each iteration, 
WERW-Kpath (Algorithm [T} performs these operations: 

1) A node v of the graph G is selected with a probability 
proportional to its local effective density 6(v) 



P(v) = 



S(v) 



(4) 



where 



E 



6(v) is a normalization factor. 



2) All the edges in G are marked as not traversed. 

3) The procedure MessagePropagation is invoked. 

Algorithm 1 WERW-Kpath(Graph G = (V, E), int k) 



\I(v)\ + \Q(v)\ 
I -El 



Assign each node v: S(v) 

Assign each edge e: 6j(e) = r^r 

p<-\E\-l 

for i = 1 to p do 

TV <— a counter to check the length of the ft-path 
v <— a node chosen according to Eq. [4] 
MessagePropagation(ii, N, n) 



Let us describe the procedure MessagePropagation (A\- 
gorithm[2]). This procedure carries out a loop until both the 
following conditions hold true: 

• The length of the path currently generated is no greater 
than k. This is managed through a length counter N. 

• Assuming that the walk has reached the node v n , there 
must exist at least an outgoing edge from v n which 
has not been already traversed. In detail, we attached a 
flag T(e) to each edge e; T(e) = 1 if the edge e has 
already been traversed, otherwise. If we call 0(v n ) 
the set of outgoing edges from v n , it must hold that 

\o(v n )\? J2 T ^)- 

eeO(«„) 

The former condition allows us to consider only paths up 
to length k. The latter condition, instead, avoids that the 
message get trapped into a cycle. 

If the conditions above are satisfied, the MessageProp- 
agation procedure selects an edge e„ with a probability 
proportional to the edge weight uj(e n ), given by 



P(e n ) 



w(e„) 



(5) 



where 7 = w(e„) is a normalization factor, being 

e„e6(ti„) 

OK) = K G 0{v n ) I T(e n ) = 0}. 

Let e n be the selected edge and let v n+ i be the node 
reached from v n by means of e n . The MessagePropagation 
procedure awards a bonus (equal to /? = tw) to e n , sets 
T(e n ) = 1 and increases the counter N by 1. The message 
propagation activity continues from v n +±. 

At the end of all the processes of simulation of message 
propagation, each edge e e E is assigned a centrality value 
L K (e) (in the interval [rgr, 1]) equal to its final weight oj(e). 

The time complexity of this algorithm is 0(n\E\). Our 
community detection strategy, described in the following, 
adopts this algorithm to weight edges of the network. 



Algorithm 2 MessagePropagation(Node v, int N, int k) 

1: while N < K and (|0(«)| + E ee o W T ( e )) do 
2: e„(-e£ {0(v) \ T(e) = 0}, according to Eq.[5] 



3: u;(e vw ) 
4: T{e vw ) 



1; v ■ 



w; N 



N + l 



V. Community Structure Detection 

In the following, we present a novel algorithm to calculate 
the community structure of a network. It is baptized Fast n- 
path Community Detection (or, shortly, FKCD). The strategy 
relies on three steps: i) ranking edges by using the WERW- 
Kpath algorithm; ii) calculating the proximity (the inverse 
of the distance) between each pair of connected nodes; ii) 
partitioning the network into communities so to optimize 
the network modularity [8|, according to the LM (T). The 
algorithm is discussed as follows. 

A. Fast K-path Community Detection 

First of all, our Fast K-path Community Detection (hence- 
forth, FKCD) needs a ranking criterion to compute the 
aptitude of all the edges to propagate information through 
the network. To do so, FKCD invokes the WERW-Kpath 
algorithm, previously described. Once all the edges have 
been labeled with their K-path edge centrality, a ranking 
in decreasing order of centrality could be obtained. This is 
not fundamental, but could be useful in some applications. 
Similarly, before to proceed, a first network modularity 
esteem (hereafter, Q) could be calculated. This could help 
in order to put into evidence how Q increases during next 
steps. With respect to Q, we recall that its value ranges in the 
interval [0, 1] and, the higher Q, the better the community 
structure of the network appears evident. The computational 
cost of this first step is 0(k\E\), with k length of the K-paths 
and \E\ cardinality of E. 

The second step consists in calculating the proximity 
among each pair of connected nodes. This is done by using 
a L 2 distance (i.e., the Euclidean distance) calculated as 



^ (i-(e lfc )-L«(e fc ,)) 2 
\ fc=i 



d(k) 



(6) 



where L K (eik) (resp., L K (ei~j)) is the K-path edge cen- 
trality of the edge (resp., e^j) and d(k) is the degree 
of the node. We put into evidence that, even though the L2 
measure would return a distance, in our case, the higher 
L K (eik) (resp., L K (ekj)), the more the nodes are near, 
instead of distant. This important aspect leads us to consider 
the results of Equation [6] as the pairwise proximities of 
nodes. This step is theoretically computationally expensive, 
because it should require 0(|1/| 2 ) iterations, but in practice, 
by adopting optimization techniques, its near linear cost is 
0(d(v)\V\), where d(v) is the mean degree of all the nodes 
of the network (and it is usually small in Social Networks). 



The last step is the network partitioning. The main idea is 
inspired by the LM HI for detecting the community structure 
of weighted networks in near linear time. The partitioning 
is an iterative process. At each iteration, two simple steps 
occur: i) each node is assigned to a community chosen in 
order to maximize the network modularity Q; the possible 
increase of Q derived from eventually moving a node i into 
a community C is calculated according with Equation |2j ii) 
the second step produces a meta-network whose nodes are 
those communities previously found. The partitioning ends 
when no further improvements of Q can be obtained. 

This reflects in splitting communities connected by edges 
with high proximity, which is a global feature, thus maxi- 
mizing the network modularity. Its cost is 0(7|V|), where 
|V| is the cardinality of V and 7 is the number of iterations 
required by the algorithm to converge (in our experience, 
usually, 7 < 5). The FKCD is schematized in Algorithm [3] 

We recall that CalculateDistance computes the pairwise 
node distance by using Equation [6] Partition extracts the 
communities according to the LM descripted above and Net- 
workModularity calculates the value of network modularity 
by using Equation [T] 

The computational cost of our strategy is near linear. In 
fact, 0(K\E\+d(e)\V\+~/\V\) = 0(T\E\), by adopting effi- 
cient graph memorization in order to minimize the execution 
time for the computation of Equations [T] and [6] 

Algorithm 3 FKCD(Graph G = (V, E), int k) 

1: WERW-Kpath(G, rc) 

2: CalculateDistance(G) 

3: while Q increases at least of e (arbitrarily small) do 

4: P = Partition(G) 

5: Q <s— NetworkModularity(P) 



we generated six networks by varying the mixing parameter 
fi = 0.1, 0.2,..., 0.6. 

Figure [T] highlights the quality of the obtained results. 
The measure adopted is the normalized mutual information 
02). Values obtained put into evidence that our strategy 
performs fair good results, avoiding the well-known effect 
due to the resolution limit of the modularity optimization 
ifTHl . Moreover, a classification of results as in Table [I] 
(discussed later) is omitted because values of Q obtained 
by using FKCD and the LM in the case of these quite small 
synthetic networks are very similar. 
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Figure 1. Test of modularity optimization using the benchmark 1161 , for N 
= 1000 nodes. The threshold value fi = 0.5 represents the border beyond 
which communities are no longer defined in the strong sense, i.e., each 
node has more neighbors in its own community than in the others 1 19]. 



VI. Experimental Results 

Our experimentation has been conducted both on synthetic 
and real-world online social networks, whose datasets are 
available online. All the experiments have been carried out 
by using a standard Personal Computer equipped with a Intel 
i5 Processor with 4 GB of RAM. 

A. Synthetic Networks 

The method proposed to evaluate the quality of the 
community structure detected by using the FKCD exploits 
the technique presented by Lancichinetti et al. fl6l . We 
generated the same synthetic networks reported in |[T6l . 
adopting the following configuration: i) N = 1000 nodes; 

ii) the four pairs of networks identified by (7, f3) = 
(2, 1), (2, 2), (3, 1), (3, 2), where 7 represents the exponent 
of the power law distribution of node degrees, {3 the expo- 
nent of the power law distribution of the community sizes; 

iii) for each pair of exponents, three values of average degree 
(k) = 15, 20, 25; iv) for each of the combinations above, 



B. Real-world Networks 

Results obtained by analyzing several real-world networks 
ll20l . ||2T1 are summarized in Table [I] This experimentation 
has been carried out to qualitatively analyze the performance 
of our strategy. Obtained results, measured by means of the 
network modularity calculated by our algorithm (FKCD), are 
compared against those obtained by using the original LM. 

Our analysis puts into evidence the following obser- 
vations: i) classic not optimized algorithms (for example 
Girvan-Newman [8 |) are unfeasible for large network anal- 
ysis; ii) results obtained by using LM are slightly higher 
than those obtained by using FKCD; on the other hand, LM 
adopts local information in order to optimize the network 
modularity, while our strategy exploits both local and global 
information; this results in (possibly) more convenient iden- 
tified community structures for some applications; iii) the 
performance of FKCD slightly increase by using longer k- 
paths; iv) both the compared efficient strategies are feasible 
even if analyzing large networks using standard resources 
of calculus (i.e., a classic personal computer); this aspect 



is important if we consider that there exist several Social 
Network Analysis tools (e.g., NodeXlQ that require opti- 
mized fast algorithms to compute the community structure 
of networks. 

Table I 

Datasets adopted in our experimentation 



Network 


No. nodes 


No. edges 


No. comm. 


Fkcd K = 5 


Fkcd K= 2o 


Lm 


CA-GrQc 


5,242 


28,980 


883 


0.734 


0.786 


0.816 


CA-HepTh 


9,877 


51,971 


1,501 


0.585 


0.648 


0.768 


CA-HepPh 


12,008 


237,010 


1,243 


0.565 


0.598 


0.659 


CA-AstroPh 


18,772 


396,160 


1,552 


0.486 


0.568 


0.628 


CA-CondMat 


23,133 


186,932 


2,819 


0.546 


0.599 


0.731 


Facebook 


63,731 


1,545,684 


6,484 


0.414 


0.444 


0.634 



VII. Conclusions 

The problem of discovering the community structure in 
large networks has been widely investigated during last 
years. Several efficient approaches based on local infor- 
mation have been proposed, and are feasible even when 
analyzing large networks because of their low computational 
cost. The main drawback of the existing techniques is that 
they do not consider global information about the topology 
of the network. In this work we presented a novel strategy 
that has two advantages. The former is that it exploits both 
local and global information. The latter is that, by using 
some optimization, it efficiently provides good results. 

This way, our approach is able to discover the community 
structure in, possibly large, networks. Our experimental 
evaluation, carried out over both synthetic and real-world 
networks, proves the efficiency and the robustness of the 
proposed strategy. Some future directions of research include 
the creation of a friendship recommender systems which 
suggests new possible connections to the users of a Social 
Network, based on the communities they belong to. Finally, 
we plan to design an algorithm to estimate the strength of 
ties between two social network users: for instance, in the 
case of networks like Facebook, this is equivalent to estimate 
the friendship degree between a pair of users. 
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